We’ve imported all of the metadata for each congressional member
(House, party, etc.), the trading disclosures downloaded from the
QuiverQuant API, as well as the stock data pulled using tidyquant. I
added the following columns: report_lag and
overdue, which is the amount of days after the transaction
that the disclosure was made, and the amount that the lag is “overdue”,
i.e. past the 45 day limit. We’ll take a glimpse at a few random rows
from the dataset below:
| report_date | trans_date | Ticker | Representative | Transaction | Amount | House | Range | report_lag | overdue | DNR | first_name | last_name | party |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-05-08 | 2020-04-14 | CTSH | David Perdue | Sale | 15001 | Senate | $15,001 - $50,000 | 24 | 0 | 0 | David | Perdue | R |
| 2019-12-14 | 2019-11-26 | DAL | Donald Beyer | Purchase | 1001 | Representatives | $1,001-$15,000 | 18 | 0 | 0 | Donald | Beyer | D |
| 2018-07-06 | 2017-07-03 | OKE | David Perdue | Sale | 15001 | Senate | $15,001 - $50,000 | 368 | 323 | 0 | David | Perdue | R |
| 2018-05-23 | 2018-03-28 | GE | Carlos Curbelo | Purchase | 1001 | Representatives | $1,001-$15,000 | 56 | 11 | 0 | Carlos | Curbelo | R |
| 2020-07-24 | 2020-06-29 | BAM | James Inhofe | Sale | 1001 | Senate | $1,001 - $15,000 | 25 | 0 | 0 | James | Inhofe | R |
| 2017-12-06 | 2017-09-08 | PKI | David Perdue | Sale | 1001 | Senate | $1,001 - $15,000 | 89 | 44 | 0 | David | Perdue | R |
| 2017-03-29 | 2017-03-07 | GM | James Inhofe | Purchase | 50001 | Senate | $50,001 - $100,000 | 22 | 0 | 0 | James | Inhofe | R |
| 2018-11-19 | 2018-10-01 | INTC | Katherine Clark | Purchase | 1001 | Representatives | $1,001-$15,000 | 49 | 4 | 0 | Katherine | Clark | D |
| 2018-09-29 | 2018-07-12 | MET | Gary Palmer | Purchase | 1001 | Representatives | $1,001-$15,000 | 79 | 34 | 0 | Gary | Palmer | R |
| 2020-07-15 | 2020-07-01 | RUN | Alan Lowenthal | Sale | 40 | Representatives | $40.00 | 14 | 0 | 0 | Alan | Lowenthal | D |
Disclosing a stock transaction after the 45 day imposed requirement should be a red flag as we might find that those who are engaged in ‘suspicious’ trading would be unwilling to report a transaction right away. Then again, we may find that someone who has acted on insider information would be more willing to abide by the rules so that their behavior wouldn’t seem suspicious at all. In any event, lets look at the average reporting lags for each of the congressional members.
There are not only quite a few congressional members who are consistently delinquent in reporting, but also many who are very delinquent. We might conjecture that there is a relationship between the delinquency of a member and how frequently they trade.
In terms of political parties, the Democrats average almost 10 days longer to report their stocks. Both parties average almost double the mandated 45 day STOCK act deadline.
However, that doesn’t seem to be the case. A log transoformation on both of the variables doesnt seem to help either.
We could also look at the average reporting lag for individual stocks to see if there are some ticker transactions which people are more likely to report later than others. First, we can sort the tickers in terms of descending mean reporting lag and plot all of the values. I’ve removed the Ticker labels since they are difficult to display with 800 different tick marks, but the plot is interactive, so zoom in on a section and hover over the points to see the symbol.
Interesting for sure. Lets filter it further to see all of those stocks that, on average, were subject to delinquent reporting by congressional members. Lets set a fairly high threshold, 250 DAT, to cut down on the observations.The red dotted line represents the 45 DAT reporting deadline.
If you hover over the points you can see the average lag associated with each one, however, the transaction count for most of these stocks is also very low. Lets cross-check the top 100 highly delinquent stocks to see if some of theses are the same stocks that are in the 100 most popular across congressional members.
trade_delin_intersection <- ticker_lags %>%
top_n(100, trans_count) %>%
inner_join(ticker_lags %>%
top_n(100, mean_lag), by = c('Ticker', 'mean_lag', 'sd_lag', 'trans_count'))
| Ticker | mean_lag | sd_lag | trans_count | |
|---|---|---|---|---|
| 1 | NA | NA | NA | NA |
Alas, it seems like there are no overlapping tickers for these subsets of the data.
Take a look at the graphs below. In general the higher the average delinquency of the stock, the lower the transaction count, although the relationship is not linear. We can try to model this later.
The density plot below shows the log scaled reporting lag for both houses and parties. The vertical red lines represent the (log) 45 day reporting deadline. The vast majority of transactions get reported on time, but it seems that there is some disparity between parties at the extreme end, at least for the Senate.
Let’s take a look at the total transaction amounts for each of the tickers in our database that have a total transaction count greater than 50.
## `summarise()` has grouped output by 'as.character(trans_date)'. You can
## override using the `.groups` argument.
We can try to identify some preliminary clusters of trading members
by the securities that they have invested in over the timeframe we have
data for. Let’s start with a simple unweighted matrix with a
1 for each security has held and a 0 for each
security they have not held. Since our values are binary and we don’t
want to destroy the sparsity of the matrix, we will not center or scale
the data at all.
trading_members <- unique(trades$Representative)
i<- c()
j <- c()
x <- c()
for(ind in 1:length(trading_members)){
df <- filter(trades, Representative == trading_members[ind])
unique_tickers <- unique(df$Ticker)
j_ind <- which(tickers %in% unique_tickers)
i_ind <- rep(ind, times = length(j_ind))
values <- rep(1, times = length(j_ind))
i <- c(i, i_ind)
j <- c(j, j_ind)
x <- c(x, values)
}
A1 <- sparseMatrix(i, j, x = x)
Our matrix has 167 rows and 818 columns which corresponds to the
number of trading members and unique tickers, respectively. Let’s check
out the structure of A1.
str(A1)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:3334] 6 7 8 9 10 11 12 13 14 15 ...
## ..@ p : int [1:819] 0 12 14 88 90 92 93 94 96 97 ...
## ..@ Dim : int [1:2] 167 818
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ x : num [1:3334] 1 1 1 1 1 1 1 1 1 1 ...
## ..@ factors : list()
We can create another sparse matrix with the number of transactions a member has made for each stock expressed as a proportion of each members total transaction count. The nonzero elements in each column of data will be extracted, zero centered, and scaled such that the mean of each column is a near-zero value while still maintaining sparsity.
member_totals <- trades %>%
group_by(Representative) %>%
summarise(total_count = n())
member_trades <- trades %>%
group_by(Representative, Ticker) %>%
summarize(n = n()) %>%
left_join(member_totals, by = "Representative") %>%
rowwise() %>%
mutate(pr_trans = n/total_count)
## `summarise()` has grouped output by 'Representative'. You can override using
## the `.groups` argument.
i<- c()
j <- c()
x <- c()
for(ind in 1:length(trading_members)){
df <- filter(member_trades, Representative == trading_members[ind])
unique_tickers <- unique(df$Ticker)
j_ind <- which(trade_tickers %in% unique_tickers)
i_ind <- rep(ind, times = length(j_ind))
values <- df$pr_trans
i <- c(i, i_ind)
j <- c(j, j_ind)
x <- c(x, values)
}
A2 <- sparseMatrix(i, j, x = x)
for(i in 1:ncol(A2)){
vec = A2[,i]
ind <- which(vec != 0)
subset = vec[ind]
minv = min(subset)
maxv = max(subset)
if(length(subset)>1&sd(subset)>0){
#A2[ind,i] <- (subset-minv)/(maxv-minv)*(2)-1
A2[ind,i] <- scale(subset)
} else if(length(subset)>1&sd(subset)==0){
A2[ind,i] <- 1
} else if(length(subset)==1&all(subset>1)){
A2[ind,i] <- 1
}
}
Then we can create a sparse matrix with each stocks’ proportion of the total transaction amount (purchase and sale) for each members portfolio. The non-zero elements in each column will be scaled accordingly as described above.
member_totals <- trades %>%
group_by(Representative) %>%
summarize(total_amt = sum(Amount))
member_ticker_weights <- trades %>%
group_by(Representative, Ticker) %>%
summarize(ticker_amt = sum(Amount)) %>%
full_join(member_totals, by = 'Representative') %>%
rowwise() %>%
mutate(pr_weight = ticker_amt/total_amt)
## `summarise()` has grouped output by 'Representative'. You can override using
## the `.groups` argument.
i<- c()
j <- c()
x <- c()
for(ind in 1:length(trading_members)){
df <- filter(member_ticker_weights, Representative == trading_members[ind])
unique_tickers <- unique(df$Ticker)
j_ind <- which(trade_tickers %in% unique_tickers)
i_ind <- rep(ind, times = length(j_ind))
values <- df$pr_weight
i <- c(i, i_ind)
j <- c(j, j_ind)
x <- c(x, values)
}
A3 <- sparseMatrix(i, j, x = x)
for(i in 1:ncol(A3)){
vec = A3[,i]
ind <- which(vec != 0)
subset = vec[ind]
maxv = max(vec)
minv = min(vec)
if(length(subset)!=1&sd(subset)!=0)
#A3[ind,i] <- (subset-minv)/(maxv-minv)*2 - 1
A3[ind,i] <- scale(subset)
}
Next, we’ll create sparse representations of the data from a date perspective. For each date that a member made a transaction, we will include a proportional value of their total transactions on that day relative to the total value of all transactions that member made on all days they had transactions. Columns will be centered and scaled.
date_set <- unique(trades$trans_date)
i<- c()
j <- c()
x <- c()
for(ind in 1:length(trading_members)){
df <- filter(trades, Representative == trading_members[ind]) %>%
group_by(trans_date) %>%
summarize(total_amount = sum(Amount),
count = n(),
avg_trans = total_amount/count)
total <- sum(df$total_amount)
df$prop <- df$total_amount/total
unique_dates <- unique(df$trans_date)
j_ind <- which(date_set %in% unique_dates)
i_ind <- rep(ind, times = length(j_ind))
values <- df$prop
i <- c(i, i_ind)
j <- c(j, j_ind)
x <- c(x, values)
}
A4 <- sparseMatrix(i, j, x = x)
for(i in 1:ncol(A4)){
vec = A4[,i]
ind <- which(vec != 0)
subset = vec[ind]
maxv = max(vec)
minv = min(vec)
if(length(subset)!=1&sd(subset)!=0)
A4[ind,i] <- scale(subset)
#A4[ind,i] <- (subset-minv)/(maxv-minv)
else if(length(subset)!=1&sd(subset)==0)
A4[ind,i] <- 1
else if(length(subset)==1)
A4[ind,i] <- 1
}
Matrix A5 shown below is the total number of transactions each member every day as a proportion of the total number of transactions they made over the data period. Columns will be centered and scaled.
date_set <- unique(trades$trans_date)
i<- c()
j <- c()
x <- c()
for(ind in 1:length(trading_members)){
df <- filter(trades, Representative == trading_members[ind]) %>%
group_by(trans_date) %>%
count()
total_trades <- sum(df$n)
df$prop_trades <- df$n/total_trades
unique_dates <- unique(df$trans_date)
j_ind <- which(date_set %in% unique_dates)
i_ind <- rep(ind, times = length(j_ind))
values <- df$prop_trades
i <- c(i, i_ind)
j <- c(j, j_ind)
x <- c(x, values)
}
A5 <- sparseMatrix(i, j, x = x)
for(i in 1:ncol(A5)){
vec = A5[,i]
ind <- which(vec != 0)
subset = vec[ind]
maxv = max(vec)
minv = min(vec)
if(length(subset)!=1&sd(subset)!=0)
A5[ind,i] <- scale(subset)
#A5[ind,i] <- (subset-minv)/(maxv-minv)
else if(length(subset)!=1&sd(subset)==0)
A5[ind,i] <- 1
else if(length(subset)==1)
A5[ind,i] <- 1
}
Finally, our last matrix records the total number of all transactions for each member on every day that they made a transaction. Then the nonzero elements were centered and scaled accordingly.
i<- c()
j <- c()
x <- c()
for(ind in 1:length(trading_members)){
df <- filter(trades, Representative == trading_members[ind]) %>%
group_by(trans_date) %>%
summarize(total_amount = sum(Amount))
unique_dates <- unique(df$trans_date)
j_ind <- which(date_set %in% unique_dates)
i_ind <- rep(ind, times = length(j_ind))
values <- df$total_amount
i <- c(i, i_ind)
j <- c(j, j_ind)
x <- c(x, values)
}
A6 <- sparseMatrix(i, j, x = x)
for(i in 1:ncol(A6)){
vec = A6[,i]
ind <- which(vec != 0)
subset = vec[ind]
maxv = max(vec)
minv = min(vec)
if(length(vec[ind])!=1&sd(vec[ind])!=0)
A6[ind,i] <- scale(subset)
#A6[ind,i] <- (subset-minv)/(maxv-minv)
else if(length(subset)!=1&sd(subset)==0)
A6[ind,i] <- 1
else if(length(subset)==1)
A6[ind,i] <- 1
}
When working with sparse matrices, there isn’t always a clear cut way
to reduce the dimensionality. We tried several of the algorithms from
the sparsepca package, but found that they were a bit slow
and the results were not always meaningful in terms of our datasets.
Instead, we used the sparsesvd function from the package of
the same name. The first twenty principal components were calculated and
we ran the DBSCAN algorithm on only these components. For DBSCAN, we set
the minPts argument to n+1 and used a KNN
distplot to pick a decent value for \(\epsilon\)
A1knplot <- ggplot(as.data.frame(as.matrix(A1spc)), aes(x = V1, y = V2, color = factor(A1km_naive$cluster),text = trading_members))+
geom_point(alpha=.8)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'Kmeans\nCluster',
title = 'A1 - Portfolio sPCA with K-means Clustering')
ggplotly(A1knplot)
A1kplot <- ggplot(as.data.frame(as.matrix(A1spc)), aes(x = V1, y = V2, color = factor(A1km$cluster),text = trading_members))+
geom_point(alpha=.8)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'sK-means\nCluster',
title = 'A1 - Portfolio sPCA with Spherical K-means Clustering')
ggplotly(A1kplot)
A1plot <- ggplot(as.data.frame(as.matrix(A1spc)), aes(x = V1, y = V2, color = factor(A1clust$cluster),text = trading_members))+
geom_point(alpha=.8)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'DBSCAN\nCluster',
title = 'A1 - Portfolio sPCA')
ggplotly(A1plot)
A2kplot <- ggplot(as.data.frame(as.matrix(A2spc)), aes(x = V1, y = V2, color = factor(A2km$cluster), text = trading_members))+
geom_point(size = 2, alpha = .6)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'kMeans\nCluster',
title = 'A2 - Ticker Transaction Count Proportion sPCA')
ggplotly(A2kplot)
A2plot <- ggplot(as.data.frame(as.matrix(A2spc)), aes(x = V1, y = V2, color = factor(A2clust$cluster), text = trading_members))+
geom_point(size = 2, alpha = .6)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'DBSCAN\nCluster',
title = 'A2 - Ticker Transaction Count Proportion sPCA')
ggplotly(A2plot)
A3kplot <- ggplot(as.data.frame(as.matrix(A3spc)), aes(x = V1, y = V2, color = factor(A3km$cluster), text = trading_members))+
geom_point(alpha = .8)+
scale_color_viridis_d(option ='D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'kMeans\nCluster',
title = 'A3 - Ticker Transaction Amount Proportion sPCA')
ggplotly(A3kplot)
A3plot <- ggplot(as.data.frame(as.matrix(A3spc)), aes(x = V1, y = V2, color = factor(A3clust$cluster), text = trading_members))+
geom_point(alpha = .8)+
scale_color_viridis_d(option ='D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'DBSCAN\nCluster',
title = 'A3 - Ticker Transaction Amount Proportion sPCA')
ggplotly(A3plot)
A4kplot <- ggplot(as.data.frame(as.matrix(A4spc)), aes(x = V1, y = V2, color = factor(A4km$cluster), text = trading_members))+
geom_point(alpha = .8)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'kMeans\nCluster',
title = 'A4 - Daily Transactional Value Proportion sPCA')
ggplotly(A4kplot)
A4plot <- ggplot(as.data.frame(as.matrix(A4spc)), aes(x = V1, y = V2, color = factor(A4clust$cluster), text = trading_members))+
geom_point(alpha = .8)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'DBSCAN\nCluster',
title = 'A4 - Daily Transactional Value Proportion sPCA')
ggplotly(A4plot)
A5kplot <- ggplot(as.data.frame(as.matrix(A5spc)), aes(x = V1, y = V2, color = factor(A5km$cluster),text = trading_members))+
geom_point(alpha = .8)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'kMeans\nCluster',
title = 'A5 - Daily Transactional Count Proportion sPCA')
ggplotly(A5kplot)
A5plot <- ggplot(as.data.frame(as.matrix(A5spc)), aes(x = V1, y = V2, color = factor(A5clust$cluster),text = trading_members))+
geom_point(alpha = .8)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'DBSCAN\nCluster',
title = 'A5 - Daily Transactional Count Proportion sPCA')
ggplotly(A5plot)
A6kplot <- ggplot(as.data.frame(as.matrix(A6spc)), aes(x = V1, y = V2, color = factor(A6km$cluster), text = trading_members))+
geom_point(alpha = .8)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'kMeans\nCluster',
title = 'A6 - Daily Transaction Amount sPCA')
ggplotly(A6kplot)
A6plot <- ggplot(as.data.frame(as.matrix(A6spc)), aes(x = V1, y = V2, color = factor(A6clust$cluster), text = trading_members))+
geom_point(alpha = .6, size = 2)+
scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
labs(x = 'sPC1',
y = 'sPC2',
color = 'DBSCAN\nCluster',
title = 'A6 - Daily Transaction Amount sPCA')
ggplotly(A6plot)
cosinesim_calc <- function(sparse_matrix, clusters, id){
dim1 <- dim(sparse_matrix)[1]
data_df <- data.frame(matrix(cosSparse(t(sparse_matrix)), nrow = dim1, ncol = dim1))
colnames(data_df) <- id
rownames(data_df) <- id
member_df <- data.frame(id = id, cluster = clusters)
cl_list <- unique(clusters)
res <- list(cos_sim = data_df, members = member_df)
dput(res, file = paste('output/', deparse(substitute(sparse_matrix)), '_clustering.R', sep = ''))
}
cosinesim_calc(A1, A1clust$cluster, id = trading_members)
cosinesim_calc(A2, A2clust$cluster, id = trading_members)
cosinesim_calc(A3, A3clust$cluster, id = trading_members)
cosinesim_calc(A4, A4clust$cluster, id = trading_members)
cosinesim_calc(A5, A5clust$cluster, id = trading_members)
cosinesim_calc(A6, A6clust$cluster, id = trading_members)
A1_clustering <- dget('../analysis/output/A1_clustering.R')
A2_clustering <- dget('../analysis/output/A2_clustering.R')
A3_clustering <- dget('../analysis/output/A3_clustering.R')
A4_clustering <- dget('../analysis/output/A4_clustering.R')
A5_clustering <- dget('../analysis/output/A5_clustering.R')
A6_clustering <- dget('../analysis/output/A6_clustering.R')
A1_outliers <- A1_clustering$members %>%
filter(cluster == 0)
A2_outliers <- A2_clustering$members %>%
filter(cluster == 0)
A3_outliers <- A3_clustering$members %>%
filter(cluster == 0)
A4_outliers <- A4_clustering$members %>%
filter(cluster == 0)
A5_outliers <- A5_clustering$members %>%
filter(cluster == 0)
A6_outliers <- A6_clustering$members %>%
filter(cluster == 0)
outlier_members <- unique(c(A1_outliers$id, A2_outliers$id, A3_outliers$id, A4_outliers$id, A5_outliers$id, A6_outliers$id))
outliers_df <- data.frame(members = outlier_members,
A1 = ifelse(outlier_members %in% A1_outliers$id, 1, 0),
A2 = ifelse(outlier_members %in% A2_outliers$id, 1, 0),
A3 = ifelse(outlier_members %in% A3_outliers$id, 1, 0),
A4 = ifelse(outlier_members %in% A4_outliers$id , 1, 0),
A5 = ifelse(outlier_members %in% A5_outliers$id, 1, 0),
A6 = ifelse(outlier_members %in% A6_outliers$id, 1, 0)) %>%
rowwise() %>%
mutate(count = sum(c_across(A1:A6), na.rm = T))
out_plot <- ggplot(outliers_df, aes(x = reorder(members, -count), y = count, fill = factor(count)))+
geom_bar(stat = 'identity')+
scale_fill_viridis_d(option = 'D', begin = .2, end = .8)+
theme(axis.text.x = element_text(angle = 90, vjust=.5))+
labs(title = 'Outlier Appearance Count from DBSCAN Clustering',
x = 'Congressional Member',
y = 'Outlier Flag Count',
fill = 'Flag Count')
ggplotly(out_plot)
apply(outliers_df[,2:7], 2, sum)
## A1 A2 A3 A4 A5 A6
## 13 8 7 12 17 23
dput(outliers_df, 'outliers_df.R')
mat_list <- list(A1, A2, A3, A4, A5, A6)
res_mat <- matrix(nrow = 167, ncol = 6)
for(i in 1:length(mat_list)){
i_ds <- vegdist(mat_list[[i]], 'euc')
i_mean <- apply(as.matrix(i_ds), 2, mean)
res_mat[,i] <- i_mean
}
colnames(res_mat) <- c('A1', 'A2', 'A3', 'A4', 'A5', 'A6')
res_df <- data.frame(res_mat) %>%
mutate(member = trading_members) %>%
rowwise() %>%
mutate(mean_ds = sum(A1, A2, A3, A4, A5, A6)/6) %>%
select(member, everything())
ordered_dis <- res_df[order(-res_df$mean_ds),]
dput(ordered_dis, 'dissim.R')
dnr_mem <- trades %>%
filter(DNR == 1) %>%
group_by(Representative) %>%
summarize(transaction_count = n(),
transaction_amount = sum(Amount))
dput(dnr_mem, 'dnr_mem.R')
del_mem <- trades %>%
filter(report_lag > 45) %>%
group_by(Representative) %>%
summarize(transaction_count = n(),
transaction_amount = sum(Amount),
lag = mean(report_lag)) %>%
arrange(-lag)
dput(del_mem, 'del_mem.R')
purchase_outliers <- dget('purchase_outliers.R')
sale_outliers <- dget('sale_outliers.R')
pout_count <- purchase_outliers %>%
group_by(Representative) %>%
count() %>%
mutate(purchases = 1) %>%
select(-n)
sout_count <- sale_outliers %>%
group_by(Representative) %>%
count() %>%
mutate(sales = 1) %>%
select(-n)
total_outliers <- outliers_df %>%
full_join(pout_count, by = c('members' = 'Representative')) %>%
full_join(sout_count, by = c('members' = 'Representative')) %>%
replace_na(list(A1 = 0, A2 = 0, A3 = 0, A4 = 0, A5 = 0, A6 = 0, purchases = 0, sales = 0)) %>%
select(-count)
#create an edge list
res <- matrix(nrow = 0, ncol = 0)
for(i in 2:ncol(total_outliers)){
sub <- total_outliers$members[total_outliers[,i]==1]
grid = tidyr::crossing(sub, sub)
grid <- grid[grid$sub...1!=grid$sub...2,]
print(grid)
unq <- unique(grid$sub...1)
edges <- grid[1:length(unq)-1,]
res <- rbind(res, edges)
}
## # A tibble: 156 x 2
## sub...1 sub...2
## <chr> <chr>
## 1 David Perdue Dean Phillips
## 2 David Perdue Donna Shalala
## 3 David Perdue Gilbert Cisneros
## 4 David Perdue Greg Gianforte
## 5 David Perdue Josh Gottheimer
## 6 David Perdue Lois Frankel
## 7 David Perdue Mikie Sherrill
## 8 David Perdue Peter Meijer
## 9 David Perdue Susie Lee
## 10 David Perdue Thomas Carper
## # ... with 146 more rows
## # A tibble: 56 x 2
## sub...1 sub...2
## <chr> <chr>
## 1 Bryan Steil David Perdue
## 2 Bryan Steil Dean Phillips
## 3 Bryan Steil Gilbert Cisneros
## 4 Bryan Steil Greg Gianforte
## 5 Bryan Steil Josh Gottheimer
## 6 Bryan Steil Lloyd Doggett
## 7 Bryan Steil Susie Lee
## 8 David Perdue Bryan Steil
## 9 David Perdue Dean Phillips
## 10 David Perdue Gilbert Cisneros
## # ... with 46 more rows
## # A tibble: 42 x 2
## sub...1 sub...2
## <chr> <chr>
## 1 Dean Phillips Gilbert Cisneros
## 2 Dean Phillips Greg Gianforte
## 3 Dean Phillips Josh Gottheimer
## 4 Dean Phillips Rick Allen
## 5 Dean Phillips Susie Lee
## 6 Dean Phillips Van Taylor
## 7 Gilbert Cisneros Dean Phillips
## 8 Gilbert Cisneros Greg Gianforte
## 9 Gilbert Cisneros Josh Gottheimer
## 10 Gilbert Cisneros Rick Allen
## # ... with 32 more rows
## # A tibble: 132 x 2
## sub...1 sub...2
## <chr> <chr>
## 1 Alan Lowenthal David Perdue
## 2 Alan Lowenthal Gilbert Cisneros
## 3 Alan Lowenthal Greg Gianforte
## 4 Alan Lowenthal Josh Gottheimer
## 5 Alan Lowenthal Lois Frankel
## 6 Alan Lowenthal Pat Roberts
## 7 Alan Lowenthal Sheldon Whitehouse
## 8 Alan Lowenthal Susie Lee
## 9 Alan Lowenthal Thomas Suozzi
## 10 Alan Lowenthal Tom Malinowski
## # ... with 122 more rows
## # A tibble: 272 x 2
## sub...1 sub...2
## <chr> <chr>
## 1 Alan Lowenthal David Perdue
## 2 Alan Lowenthal Dean Phillips
## 3 Alan Lowenthal Donald Beyer
## 4 Alan Lowenthal Gilbert Cisneros
## 5 Alan Lowenthal Greg Gianforte
## 6 Alan Lowenthal Josh Gottheimer
## 7 Alan Lowenthal Kevin Hern
## 8 Alan Lowenthal Lois Frankel
## 9 Alan Lowenthal Mike Garcia
## 10 Alan Lowenthal Mo Brooks
## # ... with 262 more rows
## # A tibble: 506 x 2
## sub...1 sub...2
## <chr> <chr>
## 1 Alan Lowenthal David Perdue
## 2 Alan Lowenthal Dean Phillips
## 3 Alan Lowenthal Donald Beyer
## 4 Alan Lowenthal Gilbert Cisneros
## 5 Alan Lowenthal Greg Gianforte
## 6 Alan Lowenthal James Inhofe
## 7 Alan Lowenthal Jim Langevin
## 8 Alan Lowenthal John Hoeven
## 9 Alan Lowenthal Josh Gottheimer
## 10 Alan Lowenthal Kelly Loeffler
## # ... with 496 more rows
## # A tibble: 506 x 2
## sub...1 sub...2
## <chr> <chr>
## 1 Adam Kinzinger Austin Scott
## 2 Adam Kinzinger Brian Mast
## 3 Adam Kinzinger David Rouzer
## 4 Adam Kinzinger Donald Beyer
## 5 Adam Kinzinger Donna Shalala
## 6 Adam Kinzinger Ed Case
## 7 Adam Kinzinger Gilbert Cisneros
## 8 Adam Kinzinger Greg Gianforte
## 9 Adam Kinzinger Josh Gottheimer
## 10 Adam Kinzinger K. Conaway
## # ... with 496 more rows
## # A tibble: 272 x 2
## sub...1 sub...2
## <chr> <chr>
## 1 Adam Kinzinger Alan Lowenthal
## 2 Adam Kinzinger Brian Mast
## 3 Adam Kinzinger Cheri Bustos
## 4 Adam Kinzinger David Perdue
## 5 Adam Kinzinger Dean Phillips
## 6 Adam Kinzinger Donald Beyer
## 7 Adam Kinzinger Gilbert Cisneros
## 8 Adam Kinzinger Greg Gianforte
## 9 Adam Kinzinger Josh Gottheimer
## 10 Adam Kinzinger Kelly Loeffler
## # ... with 262 more rows
colnames(res) <- c('source', 'target')
network <- graph_from_data_frame(d=res, directed=F)
# plot it
plot(network)
griddf <- as.data.frame(matrix(cosSparse(t(A1)), nrow = 167, ncol = 167))
#griddf <- data.frame(matrix(cosSparse(t(A2)), nrow = 167, ncol = 167))
df2 <- data.frame(member = trading_members, cluster = A2clust$cluster)
d_clust <- unique(A2clust$cluster)
colnames(griddf) <- trading_members
griddf <- griddf %>%
mutate(Rep = trading_members) %>%
select(Rep, everything())
melted <- gather(griddf, key = 'Rep2', value = 'cossim', 2:168) %>%
filter(Rep != Rep2)
heatmap <- ggplot(melted, aes(reorder(Rep, -cossim), reorder(Rep2, -cossim), fill = cossim))+
geom_tile()+
scale_fill_viridis_c(option = 'D')+
theme(axis.text = element_blank(),
axis.title = element_blank())+
labs(fill = 'Cosine\nSimiliarity',
title = 'Congressional Member Cosine Similarity Heatmap')
ggplotly(heatmap)
'%!in%' <- Negate('%in%')
res_mat <- matrix(data = NA, nrow = length(d_clust), ncol = 3)
for(k in d_clust){
print(k)
m_vec <- df2$member[df2$cluster == k]
ind <- which(griddf$Rep %in% m_vec)
col_ind <- which(colnames(griddf) %in% m_vec)
df <- griddf[ind,c(1,col_ind)] %>%
gather(key = 'Rep2', value = 'cossim', 2:length(c(1, col_ind))) %>%
filter(Rep != Rep2)
col_ind2 <- which(colnames(griddf) %!in% m_vec)
non = griddf[ind, col_ind2] %>%
gather(key = 'Rep2', value = 'cossim', 2:length(col_ind2)) %>%
filter(Rep != Rep2)
c_vec <- c(k, mean(df$cossim), mean(non$cossim))
print(mean(df$cossim))
print(mean(non$cossim))
res_mat[k+1,] <- c_vec
}
## [1] 1
## [1] 0.07771319
## [1] 0.08723857
## [1] 0
## [1] 0.1469784
## [1] 0.08723857
colnames(res_mat) <- c('cluster', 'wics', 'wocs')
print(res_mat)
## cluster wics wocs
## [1,] 0 0.14697843 0.08723857
## [2,] 1 0.07771319 0.08723857